5 research outputs found
Near Data Processing for Efficient and Trusted Systems
We live in a world which constantly produces data at a rate which only increases with time. Conventional processor architectures fail to process this abundant data in an efficient manner as they expend significant energy in instruction processing and moving data over deep memory hierarchies. Furthermore, to process large amounts of data in a cost effective manner, there is increased demand for remote computation. While cloud service providers have come up with innovative solutions to cater to this increased demand, the security concerns users feel for their data remains a strong impediment to their wide scale adoption.
An exciting technique in our repertoire to deal with these challenges is near-data processing. Near-data processing (NDP) is a data-centric paradigm which moves computation to where data resides. This dissertation exploits NDP to both process the data deluge we face efficiently and design low-overhead secure hardware designs.
To this end, we first propose Compute Caches, a novel NDP technique. Simple augmentations to underlying SRAM design enable caches to perform commonly used operations. In-place computation in caches not only avoids excessive data movement over memory hierarchy, but also significantly reduces instruction processing energy as independent sub-units inside caches perform computation in parallel. Compute Caches significantly improve the performance and reduce energy expended for a suite of data intensive applications.
Second, this dissertation identifies security advantages of NDP. While memory bus side channel has received much attention, a low-overhead hardware design which defends against it remains elusive. We observe that smart memory, memory with compute capability, can dramatically simplify this problem. To exploit this observation, we propose InvisiMem which uses the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing memory bus side channel efficiently. Our solutions obviate the need for expensive constructs like Oblivious RAM (ORAM) and Merkle trees, and have one to two orders of magnitude lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.
This dissertation also addresses a related vulnerability of page fault side channel in which the Operating System (OS) induces page faults to learn application's address trace and deduces application secrets from it. To tackle it, we propose Sanctuary which obfuscates page fault channel while allowing the OS to manage memory as a resource. To do so, we design a novel construct, Oblivious Page Management (OPAM) which is derived from ORAM but is customized for page management context. We employ near-memory page moves to reduce OPAM overhead and also propose a novel memory partition to reduce OPAM transactions required. For a suite of cloud applications which process sensitive data we show that page fault channel can be tackled at reasonable overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144139/1/shaizeen_1.pd
Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures
This paper evaluates the efficacy of recent commercial processing-in-memory
(PIM) solutions to accelerate fast Fourier transform (FFT), an important
primitive across several domains. Specifically, we observe that efficient
implementations of FFT on modern GPUs are memory bandwidth bound. As such, the
memory bandwidth boost availed by commercial PIM solutions makes a case for PIM
to accelerate FFT. To this end, we first deduce a mapping of FFT computation to
a strawman PIM architecture representative of recent commercial designs. We
observe that even with careful data mapping, PIM is not effective in
accelerating FFT. To address this, we make a case for collaborative
acceleration of FFT with PIM and GPU. Further, we propose software and hardware
innovations which lower PIM operations necessary for a given FFT. Overall, our
optimized PIM FFT mapping, termed Pimacolaba, delivers performance and data
movement savings of up to 1.38 and 2.76, respectively, over a
range of FFT sizes
Egalitarian ORAM: Wear-Leveling for ORAM
While non-volatile memories (NVMs) provide several desirable characteristics
like better density and comparable energy efficiency than DRAM, DRAM-like
performance, and disk-like durability, the limited endurance NVMs manifest
remains a challenge with these memories. Indeed, the endurance constraints of
NVMs can prevent solutions that are commonly employed for other mainstream
memories like DRAM from being carried over as-is to NVMs. Specifically, in this
work we observe that, Oblivious RAM (ORAM) primitive, the state-ofart solution
to tackle memory bus side channel vulnerability, while widely studied for
DRAMs, is particularly challenging to implement as-is for NVMs as it severely
affects endurance of NVMs. This is so, as the inherent nature of ORAM primitive
causes an order of magnitude increase in write traffic and furthermore, causes
some regions of memory to be written far more often than others. This
non-uniform write traffic as manifested by ORAM primitive stands to severely
affect the lifetime of non-volatile memories (1% of baseline without ORAM) to
even make it impractical to address this security vulnerabilit
Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures
Continual demand for memory bandwidth has made it worthwhile for memory
vendors to reassess processing in memory (PIM), which enables higher bandwidth
by placing compute units in/near-memory. As such, memory vendors have recently
proposed commercially viable PIM designs. However, these proposals are largely
driven by the needs of (a narrow set of) machine learning (ML) primitives.
While such proposals are reasonable given the the growing importance of ML, as
memory is a pervasive component, %in this work, we make there is a case for a
more inclusive PIM design that can accelerate primitives across domains.
In this work, we ascertain the capabilities of commercial PIM proposals to
accelerate various primitives across domains. We first begin with outlining a
set of characteristics, termed PIM-amenability-test, which aid in assessing if
a given primitive is likely to be accelerated by PIM. Next, we apply this test
to primitives under study to ascertain efficient data-placement and
orchestration to map the primitives to underlying PIM architecture. We observe
here that, even though primitives under study are largely PIM-amenable,
existing commercial PIM proposals do not realize their performance potential
for these primitives. To address this, we identify bottlenecks that arise in
PIM execution and propose hardware and software optimizations which stand to
broaden the acceleration reach of commercial PIM designs (improving average PIM
speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we
believe emerging commercial PIM proposals add a necessary and complementary
design point in the application acceleration space, hardware-software co-design
is necessary to deliver their benefits broadly